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(57) ABSTRACT 

A software mechanism for enabling a programmer to embed 
selected machine instructions into program source code in a 
convenient fashion, and optionally restricting the 
re-ordering of such instructions by the compiler without 
making any significant modifications to the compiler pro- 
cessing. Using a table-driven approach, the mechanism 
parses the embedded machine instruction constructs and 
verifies syntax and semantic correctness. The mechanism 
then translates the constructs into low-level compiler inter- 
nal representations that may be integrated into other com- 
piler code with minimal compiler changes. When also 
supported by a robust underlying inter- module optimization 
framework, library routines containing embedded machine 
instructions according to the present invention can be inlined 
into applications. When those applications invoke such 
library routines, the present invention enables the routines to 
be optimized more effectively, thereby improving run-time 
application performance. A mechanism is also disclosed 
using a "_fpreg" data type to enable floating-point arith- 
metic to be programmed from a source level where the 
programmer gains access to the full width of the floating- 
point register representation of the underlying processor. 

56 Claims, 3 Drawing Sheets 
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OPTIMIZATION OF SOURCE CODE WITH 
EMBEDDED MACHINE INSTRUCTIONS 

TECHNICAL FIELD OF THE INVENTION 

This invention relates generally to the compilation of 
computer code performing floating-point arithmetic opera- 
tions on floating-point registers, and more particularly to 
enabling, via a new data type, access in source code to the 
full width of the floating point register representation in the 
underlying processor. 

BACKGROUND OF THE INVENTION 

Source-level languages like C and C++ typically do not 
support constructs that enable access to low-level machine- 
instructions. Yet many instruction set architectures provide 
functionally useful machine instructions that cannot readily 
be accessed from standard source-level constructs. t 

Typically, programmers, and notably operating system 
developers, access the functionality afforded by these spe- 
cial (possibly privileged) machine-instructions from source 
programs by invoking subroutines coded in assembly 
language, where the machine instructions can be directly 
specified. This approach suffers from a significant perfor- 
mance drawback in that the overhead of a procedure call/ 
return sequence must be incurred in order to execute the 
special machine instruction(s). Moreover, the assembly- 
coded machine instruction sequence cannot be optimized 
along with the invoking routine. 

To overcome the performance limitation with the assem- 
bly routine invocation strategy, compilers known in the art, 
such as the Gnu C compiler ("gec"), provide some rudimen- 
tary high-level language extensions to allow programmers to 
embed a restricted set of machine instructions directly into 
their source code. In fact, the 1990 American National 
Standard for Information Systems — Programming Lan- 
guage C (hereinafter referred to as the "ANSI Standard") 
recommends the "asm" keyword as a common extension 
(though not part of the standard) for embedding machine 
instructions into source code. The ANSI Standard specifies 
no details, however, with regard to how this keyword is to 
be used. 

Current schemes that employ this strategy have draw- 
backs. For instance, gec employs an arcane specification 
syntax. Moreover, the gec optimizer does not have an innate 
knowledge of the semantics of embedded machine instruc- 
tions and so the user is required to spell out the optimization 
restrictions. No semantics checks are performed by the 
compiler on the embedded instructions and for the most part 
they are simply "passed through" the compiler and written 
out to the target assembly file. 

Other drawbacks of the inline assembly support in current 
compilers include: 

(a) lack of functionality to allow the user to specify 
scheduling restrictions associated with embedded 
machine instructions. This functionality would be par- 
ticularly advantageous with respect to privileged 
instructions. 

(b) imposition of arbitrary restrictions on the kind of 
operands that may be specified for the embedded 
machine instructions, for example: 

the compiler may require operands to be simple pro- 
gram variables (where permitting an arbitrary arith- 
metic expression as an operand would be more 
advantageous); and 

the operands may be unable to refer to machine-specific 
resources in a syntactically natural manner. 



10 



(c) lack of functionality to allow the programmer to 
access the full range and precision of internal floating- 
point register representations when embedding 
floating-point instructions. This functionality would 
simplify high-precision or high-performance floating- 
point algorithms. 

(d) imposition of restrictions on the ability to inline 
library procedures that include embedded machine 
instructions into contexts where such procedures are 
invoked, thereby curtailing program optimization 
effectiveness. 

In addition, when only a selected subset of the machine 
opcodes are permitted to be embedded into user programs, 
it may be cumbersome in current compilers to extend the 
15 embedded assembly support for other machine opcodes. In 
particular, this may require careful modifications to many 
portions of the compiler source code. An extensible mecha- 
nism capable of extending embedded assembly support to 
other machine opcodes would reduce the number and com- 
20 ^'plexity of source code modifications required. 
- It would therefore be highly advantageous to develop a 
compiler with a sophisticated capability for processing 
machine instructions embedded in high level source code. A 
"natural*' specification syntax would be user friendly, while 
independent front-end validation would reduce the potential 
for many programmer errors. Further, it would be advanta- 
geous to implement an extensible compiler mechanism that 
processes source code containing embedded machine 
instructions where the mechanism is smoothly receptive to 
programmer-defined parameters indicating the nature and 
extent of compiler optimization permitted in a given case. A 
particularly useful application of such an improved compiler 
would be in coding machine-dependent "library" functions 
which would otherwise need to be largely written in assem- 
bly language and would therefore not be subject to effective 
compiler optimization, such as inlining. 

In summary, there is a need for a compiler mechanism that 
allows machine instructions to be included in high-level 
program source code, where the translation and compiler 
optimization of such instructions offers the following advan- 
tageous features to overcome the above-described shortcom- 
ings of the current art: 

a) a "natural" specification syntax for embedding low- 
level hardware machine instructions into high-level 
computer program source code. 

b) a mechanism for the compiler front-end to perform 
syntax and semantic checks on the constructs used to 
embed machine instructions into program source code 
in an extensible and uniform manner, that is indepen- 
dent of the specific embedded machine instructions. 

c) an extensible mechanism that minimizes the changes 
required in the compiler to support additional machine 
instructions. 

d) a mechanism for the programmer to indicate the degree 
of instruction scheduling freedom that may be assumed 
by the compiler when optimizing high-level programs 
containing certain types of embedded machine instruc- 
tions. 

e) a mechanism to "inline" library functions containing 
embedded machine instructions into programs that 
invoke such library functions, in order to improve the 
run-time performance of such library function 
invocations, thereby optimizing overall program 
execution performance. 

Such features would gain yet further advantage and utility 
in an environment where inline assembly support could gain 
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access to the full width of the floating point registers in the The inline assembly intrinsics are "lowered" by the corn- 
target processor via specification of a corresponding data piler front end into a built-in function-call understood by the 
type in source code. code generation phase of the compiler. The code generator 

rnr\ t in turn expands the intrinsic into the corresponding machine 

SUMMARY OF THE INVENTION 5 instruction whicn ^ then subjected to low-level optimiza- 

These and other objects and features are achieved by one tion. 

embodiment of the present invention which comprises the An automated table-driven approach is used to facilitate 

following: both syntax and semantic checking of the inline assembly 

1. A general syntax for embedding or "inlining" machine intrinsics as well as the translation of the intrinsics into 
(assembly) instructions into source code. For each machine 10 actual machine instructions. The table contains one entry for 
instruction that is a candidate for source-level inlining, an each intrinsic, with the entry describing characteristics of 
"intrinsic" (built-in subroutine) is defined. A function pro- lnat intrinsic, such as its name and the name and data types 
totype is specified for each such intrinsic with enumerated of the required opcode arguments and return value (if any), 
data types used for instruction completers. The .function as wel1 other information relevant to translating the 
prototype is of the following general form: .; 15 intrinsic into a low-level machine instruction. 

I The table is used to generate (1) a file that documents the 

opcode_rcsdt=^m__opcode(opcode_argument_list[,serial- intrinsics for user programmers (including their function 

ization_constraint_spcci erj) | ' prototypes) (2) a set of routines invoked by the compiler 

where _Asm_opcode is the name of the intrinsic function front-end to parse the supported inline assembly intrinsics 

(with the "opcode" portion of the name replaced with the 20 and (3) a portion of the compiler back-end that translates the 

opcode mnemonic). Opcode completers, immediate source built-in function-call corresponding to each intrinsic into the 

operands, and register source operands are specified as appropriate machine instruction. 

arguments to the intrinsic and the register target operand (if This table-driven approach requires very few, if any, 

applicable) corresponds to the "return value" of the intrinsic. changes to the compiler when extending source-level inline 

The data types for register operands are defined to match 25 assembly capabilities to support the embedding of additional 

the requirements of the machine instruction, with the com- machine instructions. It is usually sufficient just to add a 

piler performing the necessary data type conversions on the description of the new machine instructions to the table, 

source arguments and the return value of the "inline- re-generate the derived files, and re-build the compiler, so 

assembly" intrinsics, in much the same way as for any long as the low-level components of the compiler support 

user-defined prototyped function. 30 the emission of the new machine instructions. 

Thus, the specification syntax for embedding machine 3. Where supported by a cross-module compiler optimi- 

instructions in source code is quite "natural" in that it is very zation framework, a mechanism to capture the intermediate 

similar to the syntax used for an ordinary function call in representation into a persistent format enables cross-module 

most high-level languages (e.g. C, C++) and is subject to optimization of source-code containing embedded machine 

data type conversion rules applicable to ordinary function 35 instructions. In particular, library routines with embedded 

calls. machine instructions can themselves be "inlined" into the 

Further, the list of arguments for the machine opcode is calling user functions, enabling more effective, context- 
followed by an optional instruction serialization_ sensitive optimization of such library routines, resulting in 
constraint_specifier. This feature provides the programmer improved run-time performance of applications that invoke 
a mechanism to restrict, through a parameter specified in 40 the library routines. This feature is highly advantageous, for 
source code, compiler optimization phases from re-ordering instance, in the case of math library routines that typically 
instructions across an embedded machine instruction. need to manipulate aspects of the floating-point run-time 

This feature is highly advantageous in situations where environment through special machine instructions, 

embedded machine instructions may have implicit side- The inventive machine instruction inlining mechanism is 

effects, needing to be honored as scheduling constraints by 45 also advantageously used in conjunction with a new data 

the compiler only in certain contexts known to the user. This type which enables programmatic access to the widest mode 

ability to control optimizations is particularly useful for floating-point arithmetic supported by the processor. As 

operating system programmers who have a need to embed noted in the previous section, inline support in current 

privileged low-level "system" instructions into their source compilers is generally unable to access the full range and 

code. 50 precision of internal floating-point register representations 

Serialization_constraint_specifiers are predefined into when embedding floating-point instructions. Compiler 
several disparate categories. In application, the implementations typically map source-level floating-point 
serialization__constraint_specifier associated with an data types to fixed-width memory representations. The 
embedded machine instruction is encoded as a bit-mask that memory width determines the range and degree of precision 
specifies whether distinct categories of instructions may be 55 to which real numbers can be represented. So, for example, 
re-ordered relative to the embedded machine instruction to an 8-byte floating-point value can represent a larger range of 
dynamically execute either before or after that embedded real numbers with greater precision than a corresponding 
machine instruction. The serialization constraint specifier is 4-byte floating-point value. On some processors, however, 
specified as an optional final argument to "selected inline floating-point registers may have a larger width than the 
assembly intrinsics for which user-specified optimization 60 natural width of source-level floating-point data types, 
control is desired. When this argument is omitted, a suitably allowing for intermediate floating-point results to be corn- 
conservative default value is assumed by the compiler. puted to a greater precision and numeric range; but this 

2. A mechanism to translate the source-level inline assem- extended precision and range is not usually available to the 
bly intrinsics from the source-code into a representation user of a high-level language (such as C or C++), 
understood by the compiler back-end in a manner that is 65 In order to provide access to the full width of the floating 
independent of the specific characteristics of the machine point registers for either ordinary floating-point arithmetic or 
instruction being inlined. for inline assembly constructs involving floating-point 
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operands, therefore, a new built-in data type is also disclosed diate representation of code to a low level intermediate 

herein, named "„fpreg" for the C programming language, representation thereof; 

corresponding to a floating point representation that is as FIG. 5 illustrates use of inline assembly descriptor table 

wide as the floating-point registers of the underlying pro- 301 to create a library to assist front-end validation of 

cessor. Users may take advantage of the new data type in 5 intrinsics during the compilation thereof; 

conjunction with the disclosed methods for the embedding FIG. 6 illustrates the language independent nature of file 

of a machine instruction by using this data type for the and library generation in accordance with the present inven- 

parameters and/or return value of an intrinsic that maps to a t j on . and 

floating-point machine instruction. FIG. 7 illustrates an application of the present invention, 

It is therefore a technical advantage of the present inven- 10 where cross . module optimization and * supported, 

tion to enable a flexible, easy to understand language- , 0 enable performance critical library routines (such as math 

compatible syntax for embedding or inlin.ng machine Actions) t0 access lowlevel machine instructions and still 

instructions into source code. aUow such romines , 0 be inlined imo USM applications . 

It is a further technical advantage of the present invention 

to enable the compiler to perform semantic checks [on the 15 DESCRIPTION OF THE PREFERRED 

embedded machine instructions that are specified byjinvok- EMBODIMENTS 

ing prototyped intrinsic routines, in much the same way that _ . . _ „ , ..... 

semantic checks are performed on calls to prototyped user .„ TurmD S first 10 ^ L * typical compilation process is 

routines illustrated upon which a preferred embodiment of the 

It is a still further technical advantage of the present 20 present invention may be enabled. The compiler transforms 

invention to enable inline assembly support to be extended ,he P">8«m on which it operates through several 

to new machine instructions in a streamlined manner that representations. Source code 101 is a representation created 

n ™ n »i, rm ' n ' m ', a c .v, 0 n^t Air., mmi ,:i nr cn „ r ^ by the programmer. Front-end processing 102 transforms 

greatly minimizes the need to modily compiler source code. J » m j- , 

T4 ■ . c ^ u * u* i a * f *u * source code 101 into a high-level intermediate representa- 

It is a yet further technical advantage of the present . . M , , 6 „ . L . L , , ■ * 

/ U1 , „ , t w r f . , „ r tion 103 used within the compiler. At this nigh-level mter- 

mvention to enable user-controlled optimization or embed- 25 „ , ■ i , - , , , • 

ded low-level "system" machine instructions. medlate s f 8* the associated high-level mtermediate repre- 

It is another technical advantage of the present invention mentation 103 is advantageously (although not mandatorily) 

, ! , * j j i + . . passed through high-level optimizer 104 to translate the 

to enable, where supported, cross-module optimizations, v****" . ». & . 

t , , * i • * f,'u ♦ * *u * ♦ ■ „u aa a representation into a more erhcient high-level intermediate 

notably miming, of library routines that contain embedded * . in _ * i.uu 

, . . . ° . „ representation. Code generator 105 then translates nign- 

machine instructions. 30 ^ & . & 

. iL . . . , j c t , t ■ level intermediate representation 103 into low-level mter- 

Another technical advantage of the present invention is, . . , . , 

, , . . „. *r. 4 . r - j , , mediate representation 106, whereupon a low-level opti- 

when used in conjunction with the new _fpreg data type . i * i i i • * . . * • 

, j ■ i ji • . . j f » <? »~ mizer 107 translates low-level intermediate representation 

also disclosed herein, to support and facilitate reference to ~™ * „ . t . 4 , . *; no . + 

„ 4 . . u- • * *■ • j , * i 106 into an efficient obi ect code representation 108 that can 

floating-point machine instructions in order to provide I- . r. -n u 

* *u c ii i- a * * * m , 1f be lmked into a machine-executable program. It will be 

access to the full width of floating-point registers provided 35 ^ « f & 

» r ©- appreciated that some compilers are known in the art in 

yj* P r ° cesso t r ' l i ,u u ji *u r « ^ which front-end and code generation stages 102 and 105 are 

The foregoing has outlined rather broadly the features and __. _ , . . 6 , , , . A , . . . 

, , . , 5 7 p 4U ♦ ■ • ^ , fk * combined (removmg the need for high-level intermediate 

technical advantages or the present invention m order that > & . . iU t . , 

the detailed description of the invention that follows may be representation 103), and m others, the code generation and 

better understood. Additional features and advantages of the 40 ow - level optimmng stages 105 and 107 are combined 

invention will be described hereinafter which form the ( temo ™S need ^ low-level intermediate represen.a- 

subject of the claims of the invention. It should be appre- ,10n 106 >- . 0ther compilers are known that add additional 

dated by those skilled in the art that the conception and the representations and translation stages W.thin this bas.c 

specific embodiment disclosed may be readily utilized as a framework, however, the present invenUon may be enab ed 

basis for modifying or designing olther structures for carry- 45 on an y « n ?£jj ous i? m P ,ler P erform,n 8 substantially the 

ing out the same purposes as the present invention. It should steps descn on . . 

also be realized by those skilled in the art that such equiva- with reference now to FIG. 2, the source level specifica- 

lent constructions do not depart from the spirit and scope of Hon features of the present invention will now be discussed, 

the invention as set forth in the appended claims. ^ will be appreciated that consistent with earlier disclosure 

50 in the background and summary sections set forth above, the 

BRIEF DESCRIPTION OF THE DRAWINGS present invention provides a mechanism by which machine 

instructions may be embedded into source code to enable 

For a more complete understanding of the present improved levels of smooth and efficient compilation thereof, 

invention, and the advantages thereof, reference is now advantageously reS ponsive to selected optimization restric- 

made to the following descriptions taken in conjunction with ^ tions ordained by the pr0 g ramme r. 

the accompanying drawings, in which. ^ ^ preferred embodiment herein, the invention is 

FIG. 1 illustrates a typical compiler system in which the enabled on compilation of c source code . It wiu be 

present invention may be enabled; appreciated, however, that use of the C programming lan- 

FIG. 2 illustrates the availability of an inline assembly guage i n this way is exemplary only, and that the invention 
intrinsic header file ("inline.h") SH 2 as a system header file 60 may D e enabled analogously on other high-level program- 
containing intrinsics through which machine instructions m i ng languages, such as C++ and FORTRAN, without 
may be inlined in accordance with the present invention; departing from the spirit and scope of the invention. 

FIG. 3 illustrates use of inline assembly descriptor table Turning now to FIG. 2, it should be first noted that blocks 

301 to generate the "inline.h" system header file; 202, 203 and 204 are explanatory labels and not part of the 

FIG. 4 illustrates use of inline assembly descriptor table 65 overall flow of information illustrated thereon. On FIG. 2, 

301 to generate a library of low-level object files available machine instructions are embedded or "inlined" into C 

to assist translation of intrinsics from a high level interme- source code 201. by the user, through the use of "built-in" 



11/8/04, EAST Version: 2.0.1.4 



US 6,2< 

7 

pseudo-function (or "intrinsic") calls. Generally, programs 
that make intrinsic calls may refer to and incorporate several 
types of files into source code 201^, such as application 
header files AHj-AH^, or system header files SHj-SH^. In 
the exemplary use of C source code described herein, the 
present invention is enabled through inclusion of system 
header file SII 2 on FIG. 2, namely an inline assembly 
intrinsic header file (see label block 202). The file is speci- 
fied as set forth below in detail so that, when included in 
source code 201^., the mechanism of the present invention 
will be enabled. 

Note that as shown on FIG. 2, the inline assembly intrinsic 
header file SH 2 (named "inline.h" on FIG. 2 to be consistent 
with the convention on many UNIX®-bascd systems) typi- 
cally contains declarations of symbolic constants and intrin- 
sic function prototypes (possibly just as comments), and 
typically resides in a central location along with other 
system header files SH a -SH„. The "inline.h" system header 
file SH 2 may then be "included" by user programs that 
embed machine instructions into source code, enabling the 
use of __Asm_opcode intrinsics. 

With reference to FIG. 3, it will be seen that a software 
tool 302 advantageously generates "inline.h" system header 
file SH 2 from an inline assembly descriptor table 301 at the 
time the compiler product is created. This table-driven 
approach is described in more detail later. 

Returning to FIG. 2, when an inline assembly intrinsic 
header file specified in accordance with the present inven- 
tion is included in a user program, processing continues to 
convert source code 201^ into object code 201 o in the 
manner illustrated on FIG. 1. 

Syntax 

The general syntax for the pseudo-function call/intrinsic 
call in C is as follows: 

opcode_result- _Asm_opcode (<completer_list>, <operand list> 
[ ,<serial izatio n_co nstra int>D ; 

A unique built-in function name (denoted above as 
__Asm_opcode) is defined for each assembly instruction 40 
that can be generated using the inline assembly mechanism. 
The inline assembly instructions may be regarded as exter- 
nal functions for which implicit external declarations and 
function definitions are provided by the compiler. These 
intrinsics are not declared explicitly by the user programs. 45 
Moreover, the addresses of the inline assembly intrinsics 
may not be computed by a user program. 

The "_Asm opcode" name is recognized by the com- 
piler and causes the corresponding assembly instruction to 
be generated. In general, a unique Asm_opcode name is 50 
defined for each instruction that corresponds to a supported 
inline assembly instruction. 

The first arguments to the _Asm_opcode intrinsic call, 
denoted as <completer_Jist> in the general syntax 
description, are symbolic constants for all the completers 55 
associated with the opcode. The inline assembly opcode 
completer arguments are followed by the instruction 

operands, denoted as <operand list> in the general syntax 

description, which are specified in order according to pro- 
tocols defined by particular instruction set architectures. 60 
Note that if an embedded machine instruction has no 
completers, then the <completer_list> and its following 
comma is omitted. Similarly, if the embedded machine 
instruction has no operands, then the <operand_list> and its 
preceding comma is omitted. 65 

An operand that corresponds to a dedicated machine 
register (source or target) may be specified using an appro- 
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priate symbolic constant. Such symbolic constants are typi- 
cally defined in the "inline.h" system header file discussed 
above with reference to FIG. 2. An immediate operand is 
specified as a simple integer constant and generally should 
5 be no larger than the corresponding instruction field width. 
To be compatible, a source language expression having a 
scalar data type must be specified as the argument corre- 
sponding to a general purpose or floating-point register 
source operand. In particular, a general purpose register 
10 source operand must be specified using an argument expres- 
sion that has an arithmetic data type (i.e. integral or floating- 
point type). An operand may nonetheless be of any type 
within this requirement for compatibility. For example, an 
operand may be a simple variable, or alternatively it may be 
15 an arithmetic expression including variables. 

Typically, the compiler will convert the argument value 
for a general-purpose register source operand into an 
unsigned integer value as wide as the register width of the 
; target machine (e.g. 32-bits or 64 -bits). 
£ Where a general-purpose register operand clearly corre- 
* sponds to a memory address, an argument value having a 
pointer type may be required. 

Any general purpose or floating-point register target oper- 
and value defined by an inline assembly instruction is treated 
as the return value of the _Asm__opcode pseudo-function 
call. 

A general-purpose register target operand value is typi- 
cally treated as an unsigned integer that is as wide as the 
register-width of the target architecture (e.g. 32-bits or 
64-bits). Therefore, the pseudo-function return value will be 
subject to the normal type conversions associated with an 
ordinary call to a function that returns an unsigned integer 
value. 

To avoid potential loss of precision when operating on 
floating-point values, however, floating-point register target 
and source operands of embedded machine instructions in a 
preferred embodiment are allowed to be as wide as the 
floating-point register- width of the target architecture. For 
architectures where the floating-point register- width exceeds 
the natural memory-width of standard floating-point data 
types (e.g. the "float" and "double" standard data types in the 
C language), the new "_fpreg" data type may be used to 
declare the arguments and return value of inline assembly 
intrinsics used to embed floating-point instructions into 
source code. As explained in more detail elsewhere in this 
disclosure, the "_fjpreg" data type corresponds to a floating- 
point representation that is as wide as the floating-point 
registers of the target architecture. 

The following examples illustrate the source code speci- 
fication technique of the present invention as described 
immediately above: 

i) For an "ADD" machine instruction of the form: 

ADD rl-i2, r3 

where rl, r2, and r3 correspond to 64-bit general-purpose 
machine registers, the function prototype for the inline 
intrinsic can be defined as follows: 

Ulnt64 _Asm_ADD (UInt64 r2, UInt64 r3) 

where "UInt64" corresponds to a 64-bit unsigned integer 
data-type. 

The ADD machine instruction can then be embedded into 
a "C" source program as follows: 
#include <inline.h> 

int gl, g2, g3; /* global integer variables */ 
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main ( ) in either direction in a dynamic sense (i.e. before to after, or 

{ after to before) in the current function body. If omitted, the 

gl= Asm ADD (g2, g3); compiler will use a default serialization mask value. For the 



} 



purposes of specifying serialization constraints in a pre- 

. *v yt «t^a^,> i ■ • c i r s ferred embodiment, the instruction opcodes may 

u) For a LOAD machine instruction of the form: advantageously, but not mandatorily, be divided into the 

LOAD.<size> value =[mem_addr] following categories: 

1. Memory Opcodes: load and store instructions 

wnere 2. ALU Opcodes: instructions with general-purpose register 

<size> is an opcode completer encoding the bit-size of io « °P era pds 

the object being loaded which may be one of "b" (for 3 ' Floating-Point Opcodes: instructions with floating-point 

byte or 8-bits), "hw" (for half-word or 16-bits) or A ™& l f CT operands 

j ^ , . t \ 4. System Opcodes: privileged "system instructions 

"word' (for word or 32-bits) 5 B ^ nch: ^ bk) P cr 

"value corresponds to a 32-bit general-purpose 6 CaU; foncdon int 

machine register whose value is to be set by the load 15 with respect t0 serialization constraints, an embedded 

instruction machine instruction may act as a "fence" that prevents the 

"mem^addr" corresponds to a 32-bit memory address scheduling of downstream instructions ahead of it, or a 

that specifies the starting location in memory of the "fence" that prevents the scheduling of upstream instruc- 

object whose value is to be loaded into "value"the tions after it. Such constraints may be referred to as a 

function prototype for the inline intrinsic can be 20 "downward fence" and "upward fence" serialization 

defined as follows: constraint, respectively. Given this classification, the serial- 
ization constraints associated with an inline system opcode 

uini32 — Asm__LOAD (_Asm_size size, void * mem_addr) can De encoded as an integer value, which can be defined by 

ORing together an appropriate set of constant bit-masks. For 

where "UInt32" corresponds to a 32-bit unsigned integer 25 a system opcode, this encoded serialization constraint value 

data-type, "void *" is a generic pointer data type, and may be specified as an optional final argument of the 

"_Asm_size" is a enumeration type that encodes one _Asm_opcode intrinsic call. For example, for the C 

of 3 possible symbolic constants. For example, in the C language, the bit-mask values may defined to be enumera- 

language, _Asm„size may be defined as follows: tion constants as follows: 

30 



typedef enum { 

_b = 1, 
__hw = 2, 
_w =3 

} __Asm_jize; ^ v . 



Alternatively, _j\sm„size may be defined to be a simple 
integer data type with pre-defined symbolic constant values 
for each legal LOAD opcode completer. Using language 
neutral "C" pre -processor directives, 

#define _b (1) if 

#define _hw (2) J 

#define _w (3) 
Note that the declarations associated with "__Asm_size" 
would be placed in the "inline. h" system header file, and 
would be read in by the compiler when parsing the source 
program. The LOAD machine instruction can then be 
embedded into a "C" program thusly: 



#includc <inline.h> 

int g; /• global integer variable */ 

int *p; /• global integer pointer variable •/ 

main 0 

{ 

g - _Ashl_LOAD (_w, p); 

} 



Certain inline assembly opcodes, notably those that may 
be considered as privileged "system" opcodes, may option- 
ally specify an additional argument that explicitly indicates 
the constraints that the compiler must honor with regard to 
instruction re-ordering. This optional "serialization con- 
straint" argument is specified as an integer mask value. The 
integer mask value encodes what types of (data independent) 
instructions may be moved past the inline assembly opcode 



35 



typedef enum { 






_NO_FENCE 


= 0 x 


o f 


_UP_MEM_FENCE 


-Ox 


1, 


_UP _ALU_FENCE 


-Ox 


2, 


_UP_FLOP_FENCE 


= 0x 


4, 


_UP_SYS _FENCE 


-Ox 


8, 


_UP_CALL_FENCE 


-Ox 


10, 


_UP_BR_FENCE 


-Ox 


20, 


_DOWN_MEM_FENCE 


-Ox 


100, 


_DOWN_ J ALU_FENCE 


-Ox 


200, 


_DOWN_FLOP_FENCE 


-Ox 


400, 


_DOWN_SYS_FENCE 


-Ox 


800, 


_DOWN_CALL_FENCE 


-Ox 


1000, 


_DOWN_BR_FENCE 


-Ox 


2000 


} _Asm_fence; 







45 (Note: The __Asm_fence definition would advantageously be placed in the 
"inline.h" system header file.) 

So, for example, to prevent the compiler from scheduling 
floating-point operations across an inlined system opcode 
that changes the default floating-point rounding mode, a 

50 programmer might use an integer mask formed as (_UP__ 
FLOP_FENCE | _DOWN_FLOP_FENCE). 

The _UP_BR„FENCE and „DOWN_BR_FENCE 
relate to "basic block" boundaries. (A basic block corre- 
sponds to the largest contiguous section of source code 

55 without any incoming or outgoing control transfers, exclud- 
ing function calls.) Thus, a serialization constraint value 
formed by ORing together these two bit masks will prevent 
the compiler from scheduling the associated inlined system 
opcode outside of its original basic block. 

60 Note that the compiler must automatically detect and 
honor any explicit data dependence constraints involving an 
inlined system opcode, independent of its associated serial- 
ization mask value. So, for example, just because an inlined 
system opcode intrinsic call argument is defined by an 

65 integer add operation; it is not necessary to explicitly specify 
the _UP_ALU_FENCE bit-mask as part of the serializa- 
tion constraint argument. 
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The serialization constraint integer mask value may be (d) Name and type of the intrinsic return value (if any) 
treated as an optional final argument to the inline system (e) With momentary reference back to FIG. 1, additional 
opcode intrinsic invocation. If this argument is omitted, the information for code generator 105 to perform the trans- 
compiler may choose to use any reasonable default serial- lation from high level intermediate representation 103 to 
ization mask value (e.g. 0x3D3D — full serialization with all 5 low level intermediate representation 106 
other opcode categories except ALU operations). Note that It will be appreciated that this table-driven approach 
if a system opcode instruction is constrained to be serialized enables the separation of the generation of. the assembly 
with respect to another instruction, the compiler must not intrinsic header file, parsing support library, and code gen- 
schedule the two instructions to execute concurrently. eration utilities from the compiler's mainstream compilation 

To specify serialization constraints at an arbitrary point in 10 processing. Any maintenance to the table may be made (such 

a program, a placeholder inline assembly opcode intrinsic as adding to the list of supported inlined instruction) without 

named _Asm_sched_fence may be used. This special affecting the compiler's primary processing functionality, 

intrinsic just accepts one argument that specifies the serial- This makes performing such maintenance easy and predict - 

ization mask value. The compiler will then honor the seri- able. The table-driven approach is also user programming 

alization constraints associated with this placeholder 15 language independent, extending the versatility of the 

opcode, but omit the opcode from the final instruction present invention. 

stream. On a more detailed level, at least three specific advantages 

The scope of the serialization constraints is limited to the are offered by this table-driven approach: 

function containing the inlined system opcode. By default, 1. Header File Generation 

the compiler may assume that called functions do not 20 The table facilitates generation of a file that documents 

specify any inlined system opcodes with serialization con- intrinsics for user programmers, providing intrinsic function 

straints. However, the _Asm_sched_fence intrinsic may prototypes and brief descriptions. Using table elements (a), 

be used to explicitly communicate serialization constraints (b), (c) and (d) as itemized above, and with reference again 

at a call-site that is known to invoke a function that executes to the preceding discussion accompanying FIG. 3, a soft- 

a serializing system instruction. 25 ware tool 302 generates an "inline. h" system header SH 2 

Example: from inline assembly descriptor table 301. Furthermore, 

If a flush cache instruction ("FC" opcode) is a privileged " inline. h" system header SH 2 also defines and contains an 

machine instruction that is to be embedded into source code enumerated set of symbolic constants, registers, completers, 

and one that should allow user-specified serialization and so forth, that the programmer may use as legal operands 

constraints, the following inline assembly intrinsic may be 30 to inline assembly intrinsic calls in the current program, 

defined: Further, in cases where an operand is a numeric constant, 

"inline. h" system header SH 2 documents the range of legal 

void _Asm_FC ([serialization„constraint_specifier]) va j ues for me operanc J y which is checked by the compiler. 

where the return type of the intrinsic is declared to be "void" 2 - Parsin g Librar V Generation 

to indicate that no data value is defined by the machine 35 ™ e table facilitates generation of part of a library that 

instruction assists, with reference again now to FIG. 1, front end 

XT ' ._ rv^ • , u ujjj- ^ processing ("FE") 102 in recognizing intrinsics specified by 

Now the FC instruction may be embedded in a C f, to v / j mi i a *ul 

. „ J . , iL the programmer in source code 101, validating first that the 

program with serialization constraints that prevent the , ... . . . . . , „i t , „ 

^ & - , . S programmer has written such intrinsics legally, and then 

compiler from re-ordering memory instructions across An \ °. . . , . . . . , ■ , . ta „ a 

, £ _ . L i_ 1 40 translating the intrinsics into high-level intermediate repre- 

the FC instruction as shown below: . 

sentation 103. 

t . * Note that in accordance with the present invention, it 

_| \; would also be possible to generate intrinsic-related front-end 



#includc <inlinc.h> *' ■ processing directly. In a preferred embodiment, however, 

int gi, g2; /• global integer variables •/ 45 library functionality is used. 

^ ain 0 Table -driven front-end processing enables an advanta- 

gi.o; /* can't be moved after fc instruction ■/ geous feature of the present invention, namely the automatic 

_Asm_FC syntax parsing and semantics checking of the user's inline 

(_up_memory_fence | _down_memory_fence); assembly code by FE 102, This feature validates that code 

g 2 - 3; /• can't be moved before FC instruction */ 5Q con taining embedded machine instructions is semantically 

I correct when it is incorporated into source code 101 in the 

same way that a front end verifies that an ordinary function 
Note that the _^Asm_FC instruction specifies memory invocation is semantically correct. This frees other process- 
fence serialization constraints in both directions preventing nig units of the compiler, such as code generator 105 and 
the re-ordering of the stores to global variables gl and g2 55 low level optimizer 107, from the time-consuming task of 
across the FC instruction. error checking. 

This front-end validation through reference to a partial 

Use of Table-driven Approach libfary . g enMeA by generation of a header file as illustrated 

A table-driven approach is advantageously used to help on FIGS. 5 and 6. Turning first to FIG. 5, in which it should 

the compiler handle assembly intrinsic operations. The table 60 again be noted that blocks 506, 507 and 508 are explanatory 

contains one entry for each intrinsic, with the entry describ- items and not part of the information flow, inline assembly 

ing the characteristics of that intrinsic. In a preferred descriptor table 301 provides elements (a), (c) and (d) as 

embodiment, although not mandatorily, those characteristics itemized above to software tool 501. This information 

may be tabulated as follows: enables software tool 501 to generate language -independent 

(a) The name of the intrinsic 65 inline assembly parser header file ("asmp.h"), which may 

(b) A brief textual description of the intrinsic then be included into corresponding source code "asmp.c" 

(c) Names and types of the intrinsic arguments (if any) 503 and compiled 504 into corresponding object code 
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"asmp.o" 505. It will thus be seen from FIG. 5 that "asmp.o" 
505 is a language-independent inline assembly parser library 
object file in a form suitable for assisting FE 102 on FIG. 1. 

With reference now to FIG. 6, it will be seen that 
"inline.h" system header SH 2 provides legal intrinsics for a 5 
programmer to invoke from source code 601. On FIG. 6, 
exemplary illustration is made of C source code 601 c ) C++ 
source code 601 p , and FORTRAN source code 601^ 
although the invention is not limited to these particular 
programming languages, and will be understood to be also 
enabled according to FIG. 6 on other programming lan- 
guages. It will be noted that each of the illustrated source 
codes 601 c , 601 and 601y have compiler operations and 
sequences 601-608 analogous to FIG. 1. Further, "asmp.o" 
library object file 505, being language independent, is uni- 
versally available to C FE 602 c , C++ FE 602 p and FOR- 35 
TRAN FE 602^ to assist in front-end error checking. Front 
end processing FE 602 does this checking by invoking 
utility functions defined in "asmp.o" library object file 505 
to ensure that embedded machine instructions encountered 

20 

in source code 601 are using the correct types and numbers 
of values. This checking is advantageously performed 
before actual code for embedded machine instructions is 
generated in high-level intermediate representation 603. 

In this way, it will be appreciated that various potential ^ 
errors may be checked in a flexible, table^driven manner that 
is easily maintained by a programmer. For example, errors 
that may be checked include: 

whether the instruction being inlined is supported. 

whether the number of arguments passed is correct. 30 

whether the arguments passed are of the correct type. 

whether the values of numeric integer constant 
arguments, if any, are within the allowable range. 

whether the serialization constraint specifier is allowed 
for the specified instruction. 35 

Furthermore, the table also allows the system to compute 
the default serialization mask for the specified instruction if 
one is needed but not supplied by the user. 

3. Code Generation 

The table 301 facilitates actual code generation (as shown 40 
on FIG, 1) by assisting CG 105 in translation of high level 
intermediate representation ("HIL") 103 to low level inter- 
mediate representation ("LIL") 106. Specifically, the table 
assists CG 105 in translating intrinsics previously incorpo- 
rated into source code 101. The table may also, when 45 
processed into a part of CG 105, perform consistency 
checking to recognize certain cases of incorrect HIL 103 that 
were not caught by error checking in front end processing 
("FE") 102. 

Note that according to the present invention, it would also 50 
be possible to generate a library of CG object files to assist 
CG 105 in processing intrinsics, similar to library 505 that 
assists FE 102, as illustrated on FIG. 5. Turning now to FIG. 

4, and again noting that blocks 405 and 406 are explanatory 
items and not part of the information flow, inline assembly 55 
descriptor assembly table 301 provides elements (c), (d) and 
(e) as itemized above to software tool 400. Using this 
information, software tool 400 generates CG source file 
401^ which in turn is compiled along with ordinary CG 
source files 401 2 -401„ (blocks 402) into CG object files 60 
403 1 -403 M . Archiver 404 accumulates CG object files 
403 1 -403„ into CG library 407. 

In more detail now, the foregoing translation from HIL 
103 to LIL 106 for intrinsics includes the following phases: 
A. Generation of data structures 65 
Automation at compiler-build time generates, for each 
possible intrinsic operation, a data structure that contains 



,174 Bl 

14 

information on the types of the intrinsic arguments (if any) 
and the type of the return value (if any). 

B. Consistency checking 

At compiler-run time, a portion of CG that performs 
consistency checking on intrinsic operations can consult the 
appropriate data structure from A immediately above. This 
portion of CG does not need to be modified when a new 
intrinsic operation is added, unless the language in which the 
table 301 is written has changed. 

C. Translation from HIL to LIL 

Most intrinsic operations can be translated from HIL to 
LIL automatically, using information from the table. In a 
preferred embodiment, an escape mechanism is also advan- 
tageously provided so that an intrinsic operation that cannot 
be translated automatically can be flagged to be translated 
later by a hand-coded routine. The enablement of the escape 
mechanism does not affect automatic consistency checking. 

The representation of an intrinsic invocation in HIL 
identifies the intrinsic operation and has a list of arguments; 
there may be an implicit return value. The representation of 
an intrinsic invocation in LIL identifies a low-level operation 
and has a list of arguments. The translation process must 
retrieve information from the HIL representation and build 
up the LIL representation. There are a number of aspects to 
this mapping: 

i. The identity of the intrinsic operation in HIL may be 
expressed by one or more arguments in LIL. Information 
in element (e) in the inline assembly descriptor table set 
forth above is used to generate code expressing this 
identity in LIL. 

ii. The implicit return value (if any) from HIL is expressed 
as an argument in LIL. 

iii. Arguments of certain types in HIL must be translated to 
arguments of different types in LIL. The translation utility 
for any given argument type must be hand-coded, 
although the correct translation utility is invoked auto- 
matically by the translation process for the intrinsic 
operation. 

iv. The serialization mask (if any) from HIL is a special 
attribute (not an argument) in LIL. 

v. The LIL arguments must be emitted in the correct order. 
Information in element (e) in the inline assembly descrip- 
tor table as set forth above describes how to take the 
identity arguments from (i), the return value argument (if 
any) from (ii), and any other HIL arguments, and emit 
them into LIL in the correct order. 

For each possible intrinsic operation, the tool run at 
compiler-build time creates a piece of CG that takes as input 
the HIL form of that intrinsic operation and generates the 
LIL form of that intrinsic operation. 

In a preferred embodiment, the tool run at compiler-build 
time advantageously recognizes when two or more intrinsic 
operations are translated using the same algorithm, and 
generates a single piece of code embodying that algorithm 
that can perform translations for all of those intrinsic opera- 
tions. When this happens, information on the identity of the 
intrinsic operation described in (i) above is stored in the 
same data structures described in A further above, so that the 
translation code can handle the multiple intrinsic operations. 
In the preferred embodiment, translation algorithms for two 
intrinsic operations are considered "the same" if all of the 
following hold: 

The HIL forms of the operations have the same number of 
arguments of the same types in the same order. 

The HIL forms of the operations either both lack a return 
value or have the same return type. 

The identity information is expressed in the LIL forms of 
the operations using the same number of arguments of 
the same types. 
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The LIL arguments for the operations occur in the same requires detailed understanding of microarchitecture perfor- 

order. mance characteristics, it is difficult to do well and is error- 

In summary, within the internal program representations prone, the resultant code is hard to maintain, and, to achieve 

used by the compiler, the inlining of assembly instructions optimal performance, the code requires rework for each new 

may be implemented as special calls in the HIL that the front 5 implementation of the target architecture, 

end generates. Every assembly instruction supported by In a preferred embodiment of the present invention, 

inlining is defied as part of this intermediate language. performance-critical library routines may now be coded in 

When an inlined assembly instruction is encountered in high-level languages, using embedded machine instructions 

the source, after performing error checking, the FE^would as needed. Such routines may then be compiled into an 

emit, as part of the HIL, a call to the corresponding dedi- 10 object file format that is amenable to cross-module optimi- 

cated HIL routine. zation and Unking in conjunction with application code that 

The CG then replaces each such call in the HIL with the > invokes the library routines. Specifically, the library routines 

corresponding machine instruction in the LIL which is then may be inlined at the call sites in the application program 

subject to optimizations by the LLO, without violating any and optimized in the context of the surrounding code, 

associated serialization constraint specifiers (as discussed 15 With reference to FIG. 7, intrinsics defined in "inline, h" 

above). system header file SH 2 enable machine instructions to be 

In addition to facilitating code generation from HIL to embedded, for example, in math library routine source code 

LIL, the table-driven approach advantageously assists code 702,. This "mathlib" source code 702, is then compiled in 

generation in other phases of the compiler. For example, and accordance with the present invention into equivalent object 

with reference again to FIG. 1, the table could also be 20 code 702 o . Meanwhile, source code 701, wishing to invoke 

extended to generate part of HLO 104 or LLO 107 for the functionality of "mathlib" is compiled into object code 

manipulating assembly intrinsics (or to generate libraries to 701 o in the traditional manner employed for cross-module 

be used by HLO 104 or LLO 107). This could be optimization. Cross-module optimization and linking 

accomplished, for instance, by having the table provide resources 704 then combine the two object codes 701 o and 

semantic information on the intrinsics that indicates optimi- 25 702 o to create optimized executable code 705. 

zation freedom and optimization constraints. Although the In FIG. 7, it should be noted that the math library is 

greatest benefit comes from using the table for as many merely used as an example. There are other analogous 

compiler stages as possible, this approach applies equally high-performance libraries for which the present invention 

well to a situation in which only some of the compiler stages brings programming advantages, e.g., for graphics, 

use the table — for example, where neither HLO 104 nor 30 multimedia, etc. 

LLO 107 use the table. In addition to easing the programming burden on library 

Although the preferred embodiment does Library Gen- writers, the ability to embed machine instructions into 

eration and Partial Code Generator Generation (as described source code spares the library writers from having to 

above) at compiler-build time, it would not be substantially re-structure low-level hand-coded assembly routines for 

different for FE 102, CG 105, or some library to consult the 35 each implementation of the target architecture, 

table (or some translated form of the table) at compiler-run Floating Point ("_fpreg") Data Type 

time instead. . 

Furthermore, although this approach has been disclosed to ™e description of a preferred embodiment has so far 

apply to assembly intrinsics, it could equally well be applied centered on the inventive mechanism disclosed herein for 

to any set of operations where there is at least one compiler 40 inlining machine instructions into the compilation and opti- 

stage that takes a set of operations in a regular form and mization of source code. It will be appreciated that this 

translates them into another form, where the translation mechanism will often be called upon to compile objects that 

process can occur in a straightforward and automated fash- include floating-point data types. A new data type is also 

■ on disclosed herein, named "_fpreg" in the C programming 

Each time a new intrinsic operation needs to be added to 45 language, which allows general programmatic access 
the compiler, a new entry is added to the table of intrinsic (including via the inventive machine instruction inlining 
operations. A compiler stage that relies on the table-driven mechanism) to the widest mode floating-point arithmetic 
approach usually need not be modified by hand in order to supported by the processor. This data type corresponds to a 
manipulate the new intrinsic operation (the exception is if floating-point representation that is as wide as the floating- 
the language in which the table itself is written has to be 50 P°int registers of the underlying processor. It will be under- 
extended— for example, to accommodated new argument slood that although discussion of the inventive data type 
type or a new return type; in such a case it is likely that herein centers on "_fpreg" as named for the C programming 
compiler stages and automation that processes the table will language, the concepts and advantages of the inventive data 
have to be modified). Reducing the amount of code that must * type are applicable in other programming languages via 
be written by hand makes it simpler and quicker to add 55 corresponding data types given their own names, 
support for new intrinsic operations, and reduces the possi- A precondition to fully enabling the "_fpreg" data type is 
biiity of error when adding new intrinsic operations. that the target processor must of course be able to support 

A further advantageous feature enabled by the present memory access instructions that can transfer data between 

invention is that key library routines may now access its floating-point registers and memory without loss of range 

machine instruction-level code so as to optimize run-time 60 or precision. 

performance. Performance-critical library routines (e.g. Depending on the characteristics of the underlying 

math or graphics library routines) often require access to processor, the "_fpreg" data type may be defined as a data 

low-level machine instructions to achieve maximum perfor- type that either requires "active" or "passive" conversion, 

mance on modern processors. In the current art, they are The distinction here is whether instructions are emitted 

typically hand-coded in assembly language. 65 when converting a value of "„ipreg" data type to or from a 

As traditionally performed, hand-coding of assembly lan- value of another floating-point data type. In an active 

guage has many drawbacks. It is inherently tedious, it conversion, a machine instruction would be needed to effect 
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the conversion whereas in a passive conversion, no machine 
instruction would be needed. In either case, the memory 
representation of an object of "_fpreg" data type is defined 
to be large enough to accommodate the full width of the 
floating-point registers of the underlying processor. 

The type promotion rules of the programming language 
are advantageously extended to accommodate the _fpreg 
data type in a natural way. For example, for the C program- 
ming language, it is useful to assert that binary operations 
involving this type shall be subject to the following promo- 
tion rules: 

1. First, if either operand has type fpreg, the other operand 

is converted to _fpreg. 

2. Otherwise, if either operand has type long double, the 
other operand is converted to long double. 

3. Otherwise, if either operand has type double, the other 
operand is converted to double. 

Note that in setting the foregoing exemplary promotion 
rules, it is assumed that the _fpreg data type which corre- 
sponds to the full floating-point register width of the target 
processor has greater range and precision than the long 
double data type. If this is not the case, then the first two 
rules may need to be swapped in sequence. 

Note also that in general, assuming type fpreg has 

greater range and/or precision than type long double, it may 
be that the result of computations involving _fpreg values 
cannot be represented precisely as a value of type long 
double. The behavior of the type conversion from _fpreg to 
long double (or to any other source-level floating-point type) 
must therefore be accounted for. A preferred embodiment 
employs a similar rule to that used for conversions from 
double to float: If the value being converted is out of range, 
the behavior is undefined; and if the value cannot be 
represented exactly, the result is either the nearest higher or 
the nearest lower representable value. 

It will be further appreciated that the application and 
availability of the _fpreg data type is not required to be 
universal within the programming language. Depending on 
processor architecture and programmer needs, it is possible 
to limit availability of the _fpreg data type to only a subset 
of the operations that may be applied to other floating-point 
types. 

To illustrate general programming use of this new data 
type, consider the following C source program that com- 
putes a floating-point 'dot-product' (a-b+c): 



double a, b, c, d; 

main 0 

{ 

d - (a * b) + c; 

} 



where the global variable d is assigned the result of the 
dot-product. For this example, according to the standard 
"usual arithmetic conversion rule" of the C programming 
language, the floating-point multiplication and addition 
expressions will be evaluated in the "double" data type using 
double precision floating-point arithmetic instructions. 
However, in order to exploit greater precision afforded by a 
processor with floating-point registers whose width exceeds 
that of the standard double data type, the _fpreg data type 
may alternatively be used as shown below: 
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double a, b, c, d; 

main 0 

{ 

d = ((_^reg) a * b) + c; 

} 



Note here that the variable "a" of type double is "typecast" 

10 into an _fpreg value. Hence, based on the previously 
mentioned extension to the usual arithmetic conversion rule, 
the variables "a", "b", and "c" of "double" type are con- 
verted (either passively or actively) into "_fpreg" type 
values and both the multiplication and addition operations 

15 will operate in the maximum floating-point precision corre- 
sponding to the full width of the underlying floating-point 
registers. In particular, the intermediate maximum precision 
product of "a" and "b" will not need to be rounded prior to 
being summed with "c". The net result is that a more 

20 accurate dot-product value will be computed and round -off 
errors are limited to the final assignment to the variable "d". 

Applying the foregoing features and advantages of the 
__fpreg data type to the inventive mechanism disclosed 
herein for inlining machine instructions, it will be seen that 

25 the parameters and return values of intrinsics specified in 
accordance with that mechanism may be declared to be of 
this data type when such intrinsics correspond to floating 
point instructions. 

For example, in order to allow source-level embedding of 

30 a floating-point fused-multiply add instruction: 

fina fr4-frl, fr2, fr3 

that sums the product of the values contained in 2 floating- 
point register source operands (frl and fr2) with the value 
contained in another floating-point register source operand 
(fr3), and writes the result to a floating-point register (fr4), 
the following inline assembly intrinsic can be defined: 

&4-_fpreg_Asm_fma (_fprcg frl, __fpreg fr2, _fpreg fr3) 

40 

Now, following the general programmatic example used 
above, this intrinsic can be used to compute a floating-point 
"dot-product" (a*b+c) in a C source program as follows: 

45 

double a, b, c, d; 

main Q 

{ 

d = Asm_fma (a, b, c); 

> 

where d is assigned the result of the floating-point compu- 
tation ((a * b)+c) 

Note that the arguments to _Asm„fma (a, b, and c) are 

55 implicitly converted from type double to type _Jpreg when 
invoking the intrinsic, and that the intrinsic return value of 
type _fpreg is implicitly converted to type double for 
assignment to d. As discussed above, if type _fpreg has 
greater range and/or precision than type double, it may be 

60 that the result of the intrinsic operation (or indeed any other 
expression of type _fpreg) cannot be represented precisely 
as a value of type double. The behavior of the type conver- 
sion from _fpreg to double (or to any other source-level 
floating-point type, such as float) must therefore be 

65 accounted for. In a preferred embodiment, a similar rule is 
employed to that used for conversions from double to float: 
If the value being converted is out of range, the behavior is 
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undefined; and if the value cannot be represented exactly, the % ing inlining library functions, the library functions 

result is either the nearest higher or nearest lower represent- ' invoked by the source code and containing machine 

able value. instructions; 

If the result of the dot-product were to be used in a (e) op , imizing said generated ma chine instructions 

subsequent floating-point operation, it would be possible to 5 accordi tQ serialization constraints represent ed by 

minimize loss of precision by carrying out that operation in argurne nis associated with said recognized intrinsic*; 

type _fpreg as follows: an ^ 

(f) when said generated machine instructions are floating- 

point instructions, selectively accessing the full range 

double a, b, c, d } e, f, g; and precision of internal floating-point register repre- 

™ am 0 sentations corresponding thereto. 

2. The method of claim 1, in which ones of said serial- 



_fprcg x, y; 



} 



x - ^Asm_fma (a, b, c); ization constraints are enabled by combining together a set 

y - _Asm_fma (e, f, g); J5 of constant bit-masks. 

d " x + y ; 3. The method of claim 1, in which further ones of said 



^ arguments associated in step (c) represent completers to be 

honored by the compiler in generating machine instructions 

Note that the results of the two dot-products are stored in in step (d)(ii). 

variables of type _fpreg; the results are summed (still in 2Q 4. The method of claim 1, in which a predetermined file 

type _fpreg), and this final sum is then converted to type includes declarations that define symbolic constants, the 

double for assignment to d. This should produce a more symbolic constants disposed to be used as parameters to 

precise result than storing the dot-product results in vari- selected intrinsic calls. 

ables of type double before summing them. Also, note that 5. The method of claim 4, in which the predetermined file 

the standard binary operator is being applied to values of 25 is a system header file. 

type __fpreg to produce an _fpreg result (which, as^previ- 6. The method of claim 4, in which the predetermined file 

ously stated, must be converted to type double for assign- is generated from a table. 

ment to d). | 7. The method of claim 6, in which the table is consulted 

at a processing time selected from the group consisting of (1) 

Conclusion r 3Q compiler build time, and (2) compiler run time. 

t* n u c _*u » + j *t_ * *i_ * • *• 8. The method of claim 6, in which each intrinsic has a 

It will be further understood that the present invention . . n . , , iL . 

, . ,. , . - 4 /, . unique identifier including an opcode, the opcode serving as 

may be embodied in software executable on a general i * r . L fui 

. , j. • . a key to referencing the table, 

purpose computer including a processing unit accessing a * method * which has 

computer-readable storage medium, a memory, and a plu- t - , 4 , - . . . 

ral it of I/O devices 35 P* uraul y °* elements, each of said elements representing 

y ' selected characteristics of intrinsics. 

Although the present invention and its advantages have 10 ^ method of claim % in which at least one element 

been described m detail, it should be understood that various fe a name for each i ntrins i c tabulated in the table, and at least 

changes, substitutions and alterations can be made herein one other e i ement ^ selected from the group consisting of: 

without departing from the spirit and scope of the invention ,. . 

as defined by the appended claims. 4 ° (a) a corresponding description; 

We claim: (^) a corresponding set of valid argument names and types 

1. A method for enabling a compiler to generate prese- le 8 all y associabie therewith; and 

lected machine instructions responsive to source level speci- (c) a corresponding set of legal return values expected 

fication thereof, the method comprising the steps of: from execution thereof. 

(a) predefining a series of source level intrinsics each 45 U - ^ method of cIaim 6 > in which tDe table provides 
having arguments and return values selectively asso- le S al operands and return values to support semantic vali- 
ciable therewith, each intrinsic indexed to preselected dation of level intrinsic specification. 

machine instructions, said machine instructions to be 12 * ^ e method of claim x > m whlch ste P s ( b ) and ( c ) are 

generated upon compiler recognition of the corre- enabled using a natural source specification syntax, 

sponding intrinsic, said series of intrinsics further being 50 13 - ^ method of claim x - in which each intrinsic has a 

extensible to facilitate table-driven addition of supple- uni 3 ue identifier comprising: 

mental intrinsics thereto; a prefix, the prefix signifying membership of the intrinsic 

(b) selectively specifying said intrinsics in source code; 1x1 said predefined series thereof; and 

(c) selectively associating arguments with ones of said 55 an opcode, the opcode indicating the preselected machine 
specified intrinsics, ones of said arguments represent- instructions to which the intrinsic is indexed. 

ing serialization constraints to be honored by the com- 14 - Tte method of claim 13, in which the prefix is _Asm. 

piler when optimizing said generated machine instruc- 15 * method of claim 1, in which step (d) further 

tj ons; includes the substep of validating the semantic accuracy of 

(d) compiling the source code, said step (d) including the to lh \ f urce cod f s P ec r ifi f d in , ste P s ^ nd < c >- c 
substeos of method of claim 1, in which at least one of said 

(i) recognizing said specified intrinsics in the source Ubrar y functions * selected from the & 0U P consistin g o£ 
code; and ( a ) a math library function; and 

(ii) generating machine instructions corresponding to (b) a graphics library function. 

said recognized intrinsics, said generated machine 65 17. The method of claim 1, in which the source code is 

instructions operating on user-selectable operands of written in a programming language selected from the group 

unrestricted type, said substep (d)(ii) further includ- consisting of: 
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(a) C; 

(b) C++; and 

(c) FORTRAN. 

18. An improved computer program compiler disposed to 
transform source level code into a lower level representation 
thereof, the lower level representation including predeter- 
mined segments of lower level code directly specified at said 
source level, the improvement comprising: 

a predefined set of source level intrinsics, each intrinsic 
indexed to corresponding ones of said predetermined 
segments of lower level code; and f 7; 
the compiler including: ' 

means for recognizing said intrinsics when encountered 

during compilation of source level code; 
means for generating segments of lower level code 

corresponding to said recognized intrinsics; and 
means for optimizing said generated segments accord- 
ing to serialization constraints represented by argu- 
ments associated with said recognized intrinsics. 

19. The improved compiler of claim 18, in which the set 
is also extensible to facilitate table -driven addition of 
supplemental intrinsics thereto. 

20. The improved compiler of claim 18, in which the 
means for generating includes means for inlining library 
functions, the library functions invoked at said source level 
and containing machine instructions. 

21. The improved compiler of claim 18, in which said 
generated segments operate on user-selectable operands of 
unrestricted type. 

22. The improved compiler of claim 18, in which floating- 
point machine instructions in said generated segments may 
access the full range and precision of internal floating-point 
register representations corresponding thereto. 

23. The improved compiler of claim 18, in which the 
lower level is a machine language. 

24. The improved compiler of claim 18, in which said 
source level code is written in a language is selected from 
the group consisting of: 

(a) C; 

(b) C++; and 

(c) FORTRAN. 

25. The improved compiler of claim 18, in which said 
serialization constraints are enabled by combining together 
a set of constant bit-masks. 

26. A computer program compiler disposed to generate 
preselected machine instructions responsive to source level 
specification thereof, comprising: 

means for predefining a series of source level intrinsics, 
each intrinsic indexed to preselected machine 
instructions, said machine instructions to be generated 
upon compiler recognition of the corresponding 
intrinsic, said series of intrinsics further being exten- 
sible to facilitate table -driven addition of supplemental 
intrinsics thereto; 

means for selectively specifying said intrinsics in source 
code; and 

means for compiling the source code, the means for 
compiling further including: 

means for recognizing said specified intrinsics in the 
source code; and 

means for generating machine instructions correspond- 
ing to said recognized intrinsics. 

27. The compiler of claim 26, in which the means for 
compiling further includes means for optimizing said gen- 
erated machine instructions according to serialization con- 
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strain ts represented by arguments associated with said rec- 
ognized intrinsics. 

28. The compiler of claim 26, in which the means for 
generating includes means for inlining library functions, the 
library functions invoked at said source level and containing 
machine instructions. 

29. The compiler of claim 26, in which said generated 
machine instructions operate on user-selectable operands of 
unrestricted type. 

30. The compiler of claim 26, in which floating-point 
machine instructions in said generated machine instructions 
may access the full range and precision of internal floating- 

1 point register representations corresponding thereto. 
* 31. A method for enabling a compiler to generate prese- 
lected machine instructions responsive to source level speci- 
fication thereof, the method comprising the steps of: 

(a) predefining a series of source level intrinsics each 
having arguments and return values selectively asso- 
ciable therewith, each intrinsic indexed to preselected 
machine instructions, said machine instructions to be 
generated upon compiler recognition of the corre- 
sponding intrinsic; 

(b) specifying ones of said intrinsics in source code; and 

(c) compiling the source code, said step (c) including the 
substeps of: 

(i) recognizing said specified intrinsics in the source 
code; and 

(ii) generating machine instructions corresponding to 
said recognized intrinsics; and 

(d) when said generated machine instructions are floating- 
point instructions, accessing the full range and preci- 
sion of internal floating-point register representations 
corresponding thereto. 

32. The method of claim 31, in which step (c) further 
includes the substep of optimizing said generated machine 
instructions according to serialization constraints repre- 
sented by arguments associated with said recognized intrin- 
sics. 

33. The method of claim 31, in which the series is also 
extensible to facilitate table-driven addition of supplemental 
intrinsics thereto. 

34. The method of claim 31, in which said generated 
machine instructions operate on user-selectable operands of 
unrestricted type. 

35. A computer program product including computer 
readable logic recorded thereon for enabling a compiler to 
generate preselected machine instructions responsive to 
source-level specification thereof, the computer program 
product comprising: 

a computer-readable storage medium; and 
a computer program stored on the computer-readable 
storage medium and operable on source code contain- 
ing extensible intrinsics selected from a predefined set 
thereof, each intrinsic in the set indexed to preselected 
machine instructions, said preselected machine instruc- 
tions to be generated upon compiler recognition of the 
corresponding intrinsic in the source code, the com- 
puter program comprising: 

means for compiling the source code, said means for 
compiling further including: 
means for recognizing intrinsics specified in the 

source code; and 
means for generating machine instructions corre- 
sponding to said recognized intrinsics, said gen- 
erated machine instructions operating on user- 
selectable operands of unrestricted type. 
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36. The computer program product of claim 35, in which instructions, the serialization constraints enabled by 
the operands include general arithmetic expressions. combining together a set of constant bit-masks; 

37. The computer program product of claim 35, in which (d) compiling the ^ice code said step ( d ) including the 
the means for compiling further includes means tor opti- h t f 

mizing said generated machine instructions according to r suosteps ot. 

serialization constraints represented by arguments associ- (0 recognizing said specified intrinsics in the source 

ated with said recognized intrinsics. code; 

38. The computer program product of claim 37, in which (ii) validating the semantic accuracy of said specified 
ones of said serialization constraints are enabled by com- intrinsics via reference to legal values contained in a 
bining together a set of constant bit-masks. predefined table, the table having a plurality of 

39. The computer program product of claim 37, in which 30 elements, one of said elements representing a name 
the means for generating also honors completers represented for each intrinsic tabulated therein, at least one other 
by arguments associated with said recognized intrinsics. demeat from the consisting of: 

40. The computer program product of claim 35 in which (1) a corresponding description; 

the set is also extensible to facilitate table-driven addition of (2) a c ondi set of valid a nt names and 

supplemental intrinsics thereto. 15 types legally associable therewith; and 

41. The computer program product of claim 35, in which j- * c i i * i 
floating-point machine instructions in said generated ( 3 > a corresponding set of legal return values 
machine instructions may access the foil range and precision , „ x ex P ected from f^ecution thereof; and 

of internal floating-point register representations corre- (m) generatmg machine instructions corresponding to 

sponding thereto. said recognized intrinsics, said generated machine 

42. The computer program product of claim 35, in which 20 instructions disposed to be executed as object code 
the means for generating further includes means for inlining according to arguments associated with said recog- 
library functions, the library functions invoked by the source nized intrinsics; and 

code and containing machine instructions. ( e ) causing the table also to generate a system header file, 

43. The computer program product of claim 35, in which me system header file including declarations that define 
the computer program is further operable in conjunction 25 symbolic constants, the symbolic constants disposed to 
with a predetermined file, the predetermined file including be ^ ^ parame ters to selected intrinsic calls, 
declarations that define symbolic constants, the symbolic 4g ^ method of clairn 47y [ n which said intrinsic calls 
constants disposed to be used as parameters to selected include inlining library functions containing embedded 
intrinsic calls. f , . « ■ . machine instructions. 

44. The computer program product of claim 43, in which 30 49 The method of claim 47j in which the system header 
the predetermined file is a system header file. file generates mac hine instructions by further reference to a 

45. The computer program product of claim 43, in which library thereof. 

the predetermined file is generated from a table. 5Q ^ method of c i aim 47 , i n which the table is 

46. The computer program product of claim 45, in which consulted in step (d)(ii) at a processing time selected from 
the table has a plurality of elements, one of said elements 35 me consisting of (1) compiler build time, and (2) 
representing a name for each intrinsic tabulated therein, at compiler run time. 

least one other element selected from the group consisting 51 The me thod of claim 47, in which steps (b) and (c) are 

°£ ; enabled using a natural source specification syntax. 

(1) a corresponding description; 52. The method of claim 47, in which each intrinsic has 

(2) a corresponding set of valid argument names and types 40 a unique identifier comprising: 

legally associable therewith; and a p re fix, the prefix signifying membership of the intrinsic 

(3) a corresponding set of legal return values expected m said predefined series thereof; and 

from execution thereof. an opcodej the opcode indicating the preselected machine 

47. A method for enabling a compiler to generate prese- instructions to which the intrinsic is indexed. 

lected machine instructions responsive to source level speci- 45 53 , -r^ method of c i aira 52, in which the prefix is _Asm. 

fication thereof, the method comprising the steps of: 54 The method of daim 47> in whicn step (d) fe ^ 

(a) predefining a series of extensible source level intrin- formed in an environment in which cross-module optimiza- 
sics each having arguments selectively associable t i on ^ enabled, and in which: 

therewith, each intrinsic indexed to preselected „ first source code is optimized in combination with second 

machine instructions, said machine instructions to be 56 source t0 obtain comb i ne d optimized object code, 

generated upon compiler recognition of the corre- j said fifSt afld second SQUrce codes each spec ifyi ng 

sponding intrinsic; ; intrinsics, said first source code specifying at least one 

(b) selectively specifying said intrinsics in source code; Ubrary fu nct i 0 n, said second source code invoking a 

(c) selectively associating arguments with ones of said library function specified in said first source code, 
specified intrinsics, said arguments selected from the 55 55. The method of claim 54, in which said Ubrary function 
group consisting of: is selected from the group consisting of: 

(i) completers to be honored by the compiler when ^ a math library fo nc tion; and 

generating machine instructions; (b) a graphics library mnction. 

(11) operands on which generated machine instructions w ^ ^ ^ ^ ^ ^ ^ ^ 

are to operate, the operands being uKwlectable of 60 a mmi { selected from the up 

unrestneted type, the operands further including cons istine of* 

floating-point operands, the floating-point operands 

corresponding to a floating-point representation in ^ ' 

registers in an underlying processor that is as wide as (b) C++; and 

said registers; and 65 (c) FORTRAN, 
(iii) serialization constraints to be honored by the 

compiler when optimizing generated machine ***** 
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